Final Project - Population Genetic Structure of the Sea Urchin Tripneustes gratilla
Introduction
This document is for my final project on population genetics of Tripneustes gratilla with data received from Genbank. A total of 222 unique GenBank submissions were curated for this project. Each entry was confirmed to contain both a high-quality Cytochrome Oxidase I (COI) sequence and corresponding geographic location data.
Photo of Collector Sea Urchin:
Research Question:
Is there genetic population structure in Tripneustes gratilla across its sampled Indo-Pacific range?
Population structure is the non-random distribution of genetic variation within a species, showing that the population is subdivided rather than being one freely interbreeding group.
In a structured population, this division is typically caused by one or more of the following factors:
Limited Gene Flow: Physical barriers (such as ocean currents, landmasses, or immense distance) restrict the movement and interbreeding of individuals.
Genetic Drift: The random change in allele frequencies, which is often amplified within small, isolated populations.
Local Adaptation: Natural selection favoring different heritable traits in distinct local environments, reinforcing genetic differences between populations.
Load Libraries
These are the libraries I used to complete this assignment.
Code
### Core Libraries Used (Setup, and Data Handling)
library(tidyverse) # Includes ggplot2, readr, dplyr, tidyr
library(knitr) # For general knitting options
library(here) # For reliable file path construction
library(readr) # For reading the cleaned data
library(dplyr) # For general data manipulation
### Libraries for Output 1: Interactive Table (DT)
library(DT) # For creating the interactive datatable()
library(htmltools) # Needed to render HTML captions/titles in DT
library(stringr) # Useful for string manipulation in table prep
### Libraries for Output 2: Structure Bar Plot (DAPC & Phylogenetics)
#New libraries I learned: line 14 to 18
library(Biostrings) # For handling DNA sequences (DNAStringSet)
library(msa) # For sequence multiple alignment (ClustalW)
library(ape) # For sequence data structure (as.DNAbin)
library(adegenet) # For DAPC and genind objects
library(ade4) # Underlying package for PCA/DAPC
library(ggplot2) # For plotting the final bar plot
library(ggtext) # For advanced text formatting (e.g., italics in titles)
### Libraries for Output 3: Map
library(sf) # For working with spatial features
library(rnaturalearth) # For global map boundaries
library(rnaturalearthdata) # For detailed map data
library(ggrepel) # For labeling points on the map without overlap
library(plotly) # Interactive plottingCustomizing Quarto Theme
In one of my Tidy_Tuesday assignments I learned how to use SCSS (Sass) to customize Quarto’s built-in theme (cosmo). SCSS is a preprocessor scripting language that compiles into standard CSS, allowing for cleaner code and the use of variables.
I created a separate file named styles.scss and linked it in the YAML header (theme: [cosmo, styles.scss]). The code for this customization is located in my script folder if you want to view it.
Data Preparation and Cleaning
The dataset was curated from publicly available GenBank submissions. Initial screening ensured that all records contained the necessary metadata, specifically COI (Cytochrome Oxidase Subunit I) sequences, along with associated collection dates and geographic locations.
I loaded and cleaned this raw data using a script (ask author Leah Barkai for script). This script prepared the Tripneustes gratilla COI sequence data for downstream genetic analysis by performing the following core steps:
What I Did to Clean My Data:
- Read each GenBank file into R as a character vector
- Used a loop to extract each record between LOCUS and //
- Extracted metadata fields: Locus, Date, Country, Coordinates, Species, Sequence
- Cleaned sequences by:
- Removing all numbers and whitespace
- Converting to uppercase letters
- Removing non-DNA characters (kept only A, C, G, T, N = unknown nucleotide)
- Keeping only sequences between 550–700 bp Cleaned metadata by:
- Cleaned metadata by:
- Standardizing country names (correcting mislabeled regions)
- Ensuring the Species column existed even when missing
- Removing known problematic/outlier sequences (e.g., MK609485)
- Cleaned dates by:
- Pulling dates from
/collection_datefirst - Using
/dateor the LOCUS line if missing - Formatting everything to YYYY-MM-DD
- Pulling dates from
- Final data preparation:
- Removed duplicate Locus entries
- Removed rows missing Country
- Combined all cleaned datasets into one table
- Saved output as Cleaned_Tripneustes_CO1_Data.csv
Read in Data
Code
Tripneustes_C01_data <- read_csv(here("Data", "Cleaned_Tripneustes_CO1_Data.csv"))
glimpse(Tripneustes_C01_data) #Glimpse of the dataRows: 222
Columns: 5
$ Locus <chr> "AY205374", "AY205375", "AY205392", "AY205393", "AY2…
$ Collection_Date <date> 2016-07-25, 2016-07-25, 2016-07-25, 2016-07-25, 201…
$ Sequence <chr> "ATCCTATTTCAACATCTATTCTGATTTTTTGGTCACCCGGAAGTCTACATC…
$ Sampling_Location <chr> "Chile", "Chile", "Chile", "Chile", "Chile", "Chile"…
$ Species <chr> "Tripneustes gratilla", "Tripneustes gratilla", "Tri…
Data Dictionary
The following table serves as the Data Dictionary, defining the variables contained in the dataset used for analysis. This information is sourced from the file Final_Project_Data_Dictionary.csv.
| Name | Plot_name | Data_Type | Description | Units |
|---|---|---|---|---|
| Locus | Sequence ID Number | Character | Each locus has a unique sequence ID number from GenBank. It allows you to identify and retrieve that specific sequence record. | Unique ID Number given by GenBank |
| Collection_Date | Collection Date | Character | The date the sample was collected. The date is stored as a character string formatted to the ISO 8601 standard (YYYY-MM-DD), ensuring consistency and ease of data analysis later on. | YYYY-MM-DD |
| Sequence | CO1 DNA Sequence | Character | This is the contiguous DNA sequence of the Cytochrome Oxidase 1 (CO1) gene region. It was obtained using a CO1 primer. | Base Pairs (bp) |
| Sampling_Location | Sample Location | Character | The geographic region where the sample was collected. | Geographic Name |
| Species | Species Name | Character | This is the species name the person who uploaded the locus sequence defined the species as. | Binomial Nomenclature |
Output #1: Interactive Table
This output provides an interactive representation of the cleaned Tripneustes gratilla sequence dataset, allowing for dynamic viewing, filtering, and sorting of the COI records and associated metadata.
Code
# 1. Prepare data by shortening sequences for table display
Data_Table <- Tripneustes_C01_data %>% # Table of Data
mutate(Sequence_Short = paste0( # Creating new column to have first 50 sequences showing
"<span title='", Sequence, "'>", # Able to expand the sequence to see the whole thing
substr(Sequence, 1, 50), # First 50 characters
ifelse(nchar(Sequence) > 50, "...", ""), # Seeing how long the sequence is
"</span>")) %>% # Closing the span tag
select(Locus, Collection_Date, Sequence_Short, Sampling_Location) # Using these columns for the table
# 2. Table caption
table_caption <- tags$caption( # Making table caption
style = 'caption-side: top; text-align: left;', # Location of caption
HTML("Table 1: GenBank Sequence Records for <i>Tripneustes gratilla</i> (Hover over the sequence to view the full length.)") # Caption to tell people they can hover over it to get the full thing
)
# 3. Make the interactive datatable
datatable(Data_Table, # Creating the interactive table
escape = FALSE, # Allow rendering
rownames = FALSE, # Hide row names
colnames = c( # Column header names that I put in the data dictionary
"Sequence ID Number",
"Collection Date",
"CO1 DNA Sequence",
"Sampling Location"),
options = list( # Table set up
pageLength = 10, # 10 rows per page
autoWidth = TRUE, # Width adjustment
scrollX = TRUE, # Allow horizontal scrolling
columnDefs = list( # Column widths
list(width = '10%', targets = c(0, 1)),
list(width = '40%', targets = 2))),
caption = table_caption) # Add captionTable 1. Sample locations were compiled from GenBank and published literature. The reported geographic level varies, including countries, territories, and provinces, representing the finest resolution available for each sample. Similarly, the reported date corresponds to the collection date where available; otherwise, the GenBank submission date or publication date was used as a proxy.
Output #2: Genetic Structure Bar Plot
This section of code performs the Discriminant Analysis of Principal Components (DAPC), the main DNA analysis used to find the global genetic families. The goal is to first determine the optimal number of genetic clusters (\(K\)) by examining the ‘elbow’ in the BIC plot, and then to sort the urchins into their most likely number of genetic clusters, denoted by \(K\). The final bar plot visually shows the percentage of each of these \(K\) genetic families found at every sampled location, revealing the global population structure.
Code to align sequences:
Code
# 1. Data Cleaning
dapc_df <- Tripneustes_C01_data %>% #Making a new data frame
# Start with raw data frame
filter(!is.na(Sequence), Sequence != "") %>% # Remove rows with NAs, missing or empty
mutate(Individual_ID = paste0("Seq_", row_number()), #Unique ID for each sequence
Original_Locus = Locus) #Use the sequence ID number aka the locus
seqs <- DNAStringSet(dapc_df$Sequence) #Converting sequence column to DNAStringSet object (required for alignment)
names(seqs) <- dapc_df$Individual_ID # Name the sequences using the new unique Individual_ID
# 2. Alignment - just a heads up - this can take a while to perform
aln <- msa(seqs, method="ClustalW", substitutionMatrix=NULL, verbose=FALSE)use default substitution matrix
Code to make the elbow plot to find K:
Code
# Perform Multiple Sequence Alignment (MSA) using ClustalW
# Does pairwise alignment which algins two sequences at a time then calculates a score to see how similar the pair is
bin <- as.DNAbin(aln) # Convert alignment output to DNAbin format (required for 'adegenet' package)
# 3. Genind - object that is designed to hold indidivual genetic info alongside metadata
gen <- DNAbin2genind(bin, # Convert DNAbin to genind object
pop = dapc_df$Sampling_Location) # Grouping by sampling location
# 4. Checking
if (nInd(gen) != nrow(dapc_df)) #Checking if individual counts in genind match input rows
stop("Mismatch between genind individuals and input rows") #Stop if mismatched
var_sites <- sum(apply(gen@tab, 2, function(x) length(unique(x)) > 1)) #Checking variation - number of variable sites
if (var_sites < 5) stop("Too few polymorphic sites for DAPC") #Stop if too few sites
gen@tab <- tab(gen, NA.method="mean") #
# Impute missing genotypes with the mean allele frequency
# 5. PCA Analysis and Elbow plot
pca_out <- dudi.pca(gen, scale=FALSE, # Run Principal Component Analysis (PCA) on genind object
nf=10, scannf=FALSE) # Retain 10 Principal Components (PCs) for clustering
max_k <- 10 # Set the maximum number of clusters (K) to test
wss <- sapply(1:max_k, function(k){ # Calculate Within Sum of Squares (WSS) for K-means clustering
kmeans(pca_out$li, centers=k, nstart=20)$tot.withinss # WSS for each K from 1 to max_k, based on PCA scores
})
optimal_k <- 3 # Manually set the optimal number of clusters chosen from the Elbow Plot
plot(1:max_k, wss, type="b", #Plot with the scores and K group
main="Elbow Plot",
xlab="Number of Clusters (K)",
ylab="Total Within Sum of Squares (WSS)")
#Showing the K I picked
wss_at_k3 <- wss[optimal_k]
points(x = optimal_k, #Putting a red dot on 3
y = wss_at_k3,
pch = 21,
bg = "red",
cex = 2) Code
# 6. Final structure bar plot
set.seed(42) #For reproducibility
km <- kmeans(pca_out$li, centers=optimal_k, nstart=20) # Run K-means clustering on the PCA scores
dapc_res <- dapc(gen, pop=factor(km$cluster), n.pca=10, n.da=optimal_k-1) # Check how well these groups separate using the full genetic dataI chose \(K=3\) highlighted in red because the WSS, which measures how compact the clusters are, stopped decreasing sharply after this point. This means \(K=3\) is the best-fitting number of distinct genetic groups.
Code to make the structure bar plot:
Code
# 1. Converting DAPC results into a long tidy format for plotting
assignments_long <- as.data.frame(dapc_res$posterior) %>% #Taking the cluster probabilities from DAPC
tibble::rownames_to_column("Individual_ID") %>% #Turning IDs into a column
pivot_longer( #Changing from wide to long format
cols = -Individual_ID, # Taking all columns except Individual_ID
names_to = "Cluster", #New column name for cluster number
values_to = "Membership_Prob" ) %>% # New column name for probability score (0 to 1)
left_join(dapc_df %>% select(Individual_ID, Sampling_Location), by = "Individual_ID") %>%
mutate(Cluster = as.integer(Cluster)) # Adding location data to the probabilities
# a) Identify the random cluster ID that corresponds to the South Africa population in this run
saf_original_id <- assignments_long %>%
filter(Sampling_Location == "South Africa") %>%
group_by(Cluster) %>%
summarise(Mean_Prob = mean(Membership_Prob), .groups = 'drop') %>%
slice_max(Mean_Prob) %>%
pull(Cluster)
# b) Identify the other two original cluster IDs
other_original_ids <- assignments_long %>%
distinct(Cluster) %>%
filter(Cluster != saf_original_id) %>%
pull(Cluster) %>%
sort()
#c) Apply the fixed labels (1 = SAF, 2 & 3 are assigned sequentially)
assignments_long <- assignments_long %>%
mutate(Cluster = case_when(
Cluster == saf_original_id ~ "1",
Cluster == other_original_ids[1] ~ "2",
Cluster == other_original_ids[2] ~ "3",
TRUE ~ as.character(Cluster))) %>%
mutate(Cluster = factor(Cluster, levels = c("1", "2", "3")))
# 2. Data prep for plot
custom_location_order <- c( #Listing the sampling locations from West to East
"Madagascar", "Reunion (France)", "Oman", "South Africa",
"Indonesia", "Philippines", "Japan", "Guam", "Papua New Guinea",
"Hawaii", "Kiritimati", "French Polynesia", "Chile")
abbreviation_map <- data.frame( #Assigning abbreviations to the locations so not so much going on
Sampling_Location = custom_location_order,
Abbreviation = c("MAD", "REU", "OMA", "SAF", "IDN", "PHL", "JPN", "GUM", "PNG", "HAW", "KIR", "PYF", "CHL"))
# 3. Calculating average genetic composotion for each sampling location
mean_assignments_loc <- assignments_long %>%
dplyr::group_by(Sampling_Location, Cluster) %>%
dplyr::summarise(Mean_Membership_Prob = mean(Membership_Prob), .groups = "drop") %>%
dplyr::mutate(Sampling_Location = factor(Sampling_Location, levels = custom_location_order),
Cluster = factor(Cluster, levels = c(1, 2, 3)) ) %>%
dplyr::left_join(abbreviation_map, by = "Sampling_Location") %>%
dplyr::mutate(Abbreviation = factor(Abbreviation, levels = abbreviation_map$Abbreviation))
# Save final assignments for map
final_cluster_assignments <- mean_assignments_loc %>%
group_by(Sampling_Location) %>%
slice_max(Mean_Membership_Prob) %>%
ungroup() %>%
# **NEW: Standardize location names for map join**
mutate(Sampling_Location = case_when(
Sampling_Location == "Reunion (France)" ~ "Reunion",
Sampling_Location == "Kiritimati" ~ "Kiribati",
Sampling_Location == "South Africa" ~ "Kwazulu-Natal",
TRUE ~ Sampling_Location),
Genetic_Cluster = Cluster) %>%
select(Sampling_Location, Genetic_Cluster)
# 4. Making the plot settings
plot_title <- paste0("Genetic Structure of *Tripneustes gratilla* by Location (K=", optimal_k, ")") #Title
plot_subtitle <- "Clustering based on CO1 sequences" #Subtitle
# Define custom colors for each cluster - colors are color blind friendly
k3_colors <- c("1" = "#E69F00", # orange
"2" = "#006D2C", # green
"3" = "#0072B2") # blue
# 5. Putting the plot all together
simplified_structure_plot <- ggplot2::ggplot(
mean_assignments_loc,
aes(x = Abbreviation, y = Mean_Membership_Prob, fill = Cluster)) + # Map location to X, probability to Y, and cluster to color
ggplot2::geom_bar(stat = "identity", width = 1) + # Draw stacked bars (height based on Y value); width=1 removes gaps
ggplot2::scale_fill_manual(values = k3_colors) + # Colors defined before
ggplot2::scale_x_discrete(expand = c(0, 0)) + # Remove padding on the X-axis
ggplot2::scale_y_continuous( # Setting for Y-axis
expand = c(0, 0), # Remove padding on the Y-axis
limits = c(0, 1.05), # Set Y-axis to go slightly above 1
breaks = seq(0, 1, by = 0.25), # Set major tick marks at 0, 0.25, 0.5, 0.75, 1
labels = c("0.00", "", "", "", "1.00")) + # Only label the 0 and 1 points clearly
ggplot2::labs(
title = plot_title, # Add title
subtitle = plot_subtitle, # Add subtitle
x = "Sampling Location (Ordered West to East)", # Label X-axis
y = NULL, # Set Y-axis label to be empty
fill = "Genetic Cluster") + # Label color legend
ggplot2::theme_minimal() + # Clean minimal theme
ggplot2::theme(
plot.title = ggtext::element_markdown(hjust = 0.5, face = "bold"), # Center and bold the plot title
plot.subtitle = element_text(hjust = 0.5, size = 10),
axis.text.y = element_text(size = 10),
axis.ticks.y = element_line(linewidth = 0.5, colour = "black"),
axis.text.x = element_text(angle = 45, hjust = 1, size = 11), # Rotate X-axis labels for better fit
panel.grid.major.y = element_line(color = "gray80", linetype = "dotted"), # Keep faint horizontal grid lines
panel.grid.minor.y = element_blank(),
panel.grid.major.x = element_blank(), # Remove vertical grid lines
plot.margin = unit(c(0.5, 0.5, 0.5, 0.5), "cm"))
#5. Save plot
ggplot2::ggsave(
filename = "structure_plot_simplified_mean_loc.png",
plot = simplified_structure_plot, #Name
path = here::here("Output"), # Use here function to save
width = 8, height = 5, units = "in")
# 6. Display the plot
plot(simplified_structure_plot)Figure 1. Mean Genetic Structure of Tripneustes gratilla by Location (\(K=3\)). The Y-axis represents the Mean Membership Probability (Admixture Proportion), where the height of each colored segment indicates the average proportion of the assigned genetic cluster in that respective sampling location. Abbreviations (West to East): MAD, Madagascar; REU, Reunion (France); OMA, Oman; SAF, South Africa; IDN, Indonesia; PHL, Philippines; JPN, Japan; GUM, Guam; PNG, Papua New Guinea; HAW, Hawaii; KIR, Kiritimati; PYF, French Polynesia; CHL, Chile. Data Source: GenBank CO1 Sequences.”
Figure 1 Summary: The analysis strongly supports the presence of significant population genetic structure in Tripneustes gratilla, as demonstrated by the clear partitioning into three distinct genetic clusters (\(K=3\)). These clusters map primarily to different geographic regions, indicating that gene flow is restricted due to geographic and oceanographic barriers across the species’ range prevent the mixing of individuals, allowing populations to diverge genetically. A particularly notable finding is the high degree of isolation of the South Africa population, which appears largely assigned to a single, unique cluster. However, a primary limitation of this study is the variable and often low sample size across locations, meaning that sites with fewer samples may not fully represent the true genetic diversity of that population.
Output #3: Map
This section visualizes the spatial distribution of the \(K=3\) genetic clusters identified in the DAPC analysis. The code generates an interactive world map that plots the location of each sample. Point color corresponds to the dominant genetic cluster, and point size reflects the sample count, allowing readers to zoom, pan, and hover for detailed site information.
Code
# 1. Data Prep
location_coordinates <- data.frame( #Adding location coordinates for the map
Sampling_Location = c(
"Caribbean, Carrie Bow Cay", "Caribbean, Nalunega, San Blas", "Easter Island", "Guam", "Gulf Stream Off Fort Pierce","Hawaii", "Indonesia", "Japan", "Kiribati", "Kwazulu-Natal", "Madagascar", "Marquesas", "Marquesas Islands", "Papua New Guinea", "Philippines", "Reunion", "Tanzania", "Tokelau", "Tonga", "Vanuatu"),
Latitude = c(16.70, 9.56, -27.12, 13.44, 27.46, 20.75, -4.00, 33.00, 1.42, -29.88, -19.00, -9.00, -9.00, -6.00, 12.87, -21.11, -6.80, -9.00, -20.00, -15.00),
Longitude = c(-88.06, -79.03, -109.35, 144.79, -80.19, -156.00, 118.00, 135.00, -173.00, 31.02, 47.00, -139.00, -139.00, 145.00, 121.77, 55.48, 39.00, -172.00, -175.00, 167.00))
# 2. Making a plot to make the map
plot_data <- final_cluster_assignments %>%
left_join(location_coordinates, by = "Sampling_Location") %>%
left_join( #Get sample count
Tripneustes_C01_data %>%
mutate(
Sampling_Location = case_when(
Sampling_Location == "Reunion (France)" ~ "Reunion",
Sampling_Location == "Kiritimati" ~ "Kiribati",
Sampling_Location == "South Africa" ~ "Kwazulu-Natal",
Sampling_Location == "Hawai'i" ~ "Hawaii",
Sampling_Location == "French_Polynesia" ~ "French Polynesia",TRUE ~ Sampling_Location)) %>%
count(Sampling_Location, name = "sample_count"), by = "Sampling_Location") %>%# Final formatting
mutate(Genetic_Cluster = factor(Genetic_Cluster, levels = c("1", "2", "3"))) %>%
filter(!is.na(Latitude)) # Remove locations without coordinates
# 3. Get world map
world <- rnaturalearth::ne_countries(scale = "medium", returnclass = "sf")
# 4. Make map
urchin_map_k3 <- ggplot(data = world) +
geom_sf(fill = "antiquewhite", color = "gray50") + # Draw landmasses
coord_sf(crs = "+proj=longlat +datum=WGS84",
xlim = c(-180, 180), # Define map area (Indo-Pacific/Indian Ocean focus)
ylim = c(-40, 40),
expand = FALSE) +
geom_point(data = plot_data, # Add points for sampling locations
aes(x = Longitude, y = Latitude,
color = Genetic_Cluster,
size = sample_count,
text = paste("Location:", Sampling_Location,
"<br>Cluster:", Genetic_Cluster,
"<br>Samples:", sample_count)),
alpha = 0.8) +
ggplot2::scale_color_manual(
values = k3_colors,
name = "Genetic Cluster (K=3)") +
scale_size_continuous(name = "Sample Count", range = c(3, 8)) + # Scale point size range
ggrepel::geom_text_repel(data = plot_data, # Add labels with ggrepel to prevent overlap
aes(x = Longitude, y = Latitude, label = Sampling_Location, color = Genetic_Cluster), size = 2.5, fontface = "bold",
nudge_y = 1.5, # Nudge labels slightly up
segment.color = 'grey50', # Use grey lines to connect labels to points
show.legend = FALSE, max.overlaps = Inf) + # Allow all labels to be plotted
labs(title = "Geographic Distribution of Urchin Genetic Clusters (K=3)",
subtitle = "Displaying 13 sites with CO1 DAPC results.",
x = "Longitude", y = "Latitude") + # Set titles and axis labels
theme_bw() + # Apply black and white theme
theme(panel.background = element_rect(fill = "lightblue", color = "black"), # Set ocean color to light blue
panel.grid.major = element_line(color = "gray70", linetype = "dotted"),
plot.title = element_text(hjust = 0.5, face = "bold"),
plot.subtitle = element_text(hjust = 0.5),
legend.position = "bottom",
plot.margin = unit(c(0.5, 1, 0.5, 1), "cm"))
# 5. Save (Saving the static image)
ggplot2::ggsave(filename = here::here("Output", "urchin_map_k3_final.png"),
plot = urchin_map_k3,
width = 18,
height = 4)
# 6. Display the map interactively
plotly::ggplotly(urchin_map_k3, tooltip = c("text"))Figure 2. Geographic Distribution of Genetic Clusters (\(K=3\)) for Tripneustes gratilla. The map displays the location and sample count (point size) for the 13 sites included in the Discriminant Analysis of Principal Components (DAPC) using mitochondrial CO1 sequence data. Colors correspond to the three distinct genetic clusters identified in the analysis. Note the isolation of the South African population (Cluster 1, orange) from the widespread Indian Ocean/Indo-Pacific cluster (Cluster 2, green) and the Central/Western Pacific cluster (Cluster 3, blue).
Conclusion: Genetic structure in Tripneustes gratilla
This analysis of 222 CO1 sequences confirms population genetic structure in the sea urchin Tripneustes gratilla across its sampled Indo-Pacific range, clearly identifying three distinct genetic clusters (\(K=3\)). The structure may be due to major oceanographic barriers and isolation by distance that restrict larval dispersal across the vast species range.